On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

نویسندگان

  • LUIS ALBERTO BARRÓN
  • Paolo Rosso
  • Fabio Crestani
  • Mark Fritz
چکیده

Automatic text re-use detection is the task of determining whether a text has been produced by considering another as its source. Plagiarism, the unacknowledged re-use of text, is probably the most famous kind of re-use. Favoured by the easy access to information through electronic media, plagiarism has raised in recent years, requesting for the attention of experts in text analysis. Automatic text re-use detection takes advantage of technology on natural language processing and information retrieval in order to compare thousands of documents, looking for the potential source of a presumably case of re-use. Machine translation technology can be used in order to uncover cases of crosslanguage text re-use. By exploiting such technology, thousands of exhaustive comparisons are possible, also across languages, something impossible to do manually. In this dissertation we pay special attention to three types of text re-use, namely: (i) cross-language text re-use, (ii) paraphrase text re-use, and (iii) monoand cross-language re-use within and from Wikipedia. In the case of cross-language text re-use, we propose a cross-language similarity assessment model based on statistical machine translation. The model is exhaustively compared to other available models up to date, showing to be one of the best options when looking for exact translations, regardless they are automatically or manually created. In the case of paraphrase, the core of plagiarism, we investigate what types of paraphrase plagiarism cases are most difficult to detect. Our analysis of plagiarism detection from the perspective of paraphrasing represents something never done before. Our insights include that the most common paraphrasing strategies when plagiarising are lexical changes. These findings should be integrated in the future generation of plagiarism detectors.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

English-Persian Plagiarism Detection based on a Semantic Approach

Plagiarism which is defined as “the wrongful appropriation of other writers’ or authors’ works and ideas without citing or informing them” poses a major challenge to knowledge spread publication. Plagiarism has been placed in four categories of direct, paraphrasing (rewriting), translation, and combinatory. This paper addresses translational plagiarism which is sometimes referred to as cross-li...

متن کامل

Plagiarism checker for Persian (PCP) texts using hash-based tree representative fingerprinting

With due respect to the authors’ rights, plagiarism detection, is one of the critical problems in the field of text-mining that many researchers are interested in. This issue is considered as a serious one in high academic institutions. There exist language-free tools which do not yield any reliable results since the special features of every language are ignored in them. Considering the paucit...

متن کامل

PAN@FIRE: Overview of the Cross-Language !ndian Text Re-Use Detection Competition

The development of models for automatic detection of text re-use and plagiarism across languages has received increasing attention in the last years. However, the lack of an evaluation framework composed of annotated datasets has caused these efforts to be isolated. In this paper we present the CL!TR 2011 corpus, the first manually created corpus for the analysis of cross-language text re-use b...

متن کامل

External Plagiarism Detection based on Human Behaviors in Producing Paraphrases of Sentences in English and Persian Languages

With the advent of the internet and easy access to digital libraries, plagiarism has become a major issue. Applying search engines is one of the plagiarism detection techniques that converts plagiarism patterns to search queries. Generating suitable queries is the heart of this technique and existing methods suffer from lack of producing accurate queries, Precision and Speed of retrieved result...

متن کامل

Deep Investigation of Cross-Language Plagiarism Detection Methods

This paper is a deep investigation of cross-language plagiarism detection methods on a new recently introduced open dataset, which contains parallel and comparable collections of documents with multiple characteristics (different genres, languages and sizes of texts). We investigate cross-language plagiarism detection methods for 6 language pairs on 2 granularities of text units in order to dra...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012